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1  Introduction 

In  this  report  we  briefly  detail  our  preliminary  work  in  reviewing  literature  and  early 
experimentation  for  single  view  morphing.  In  Section  2,  we  review  current  research 
literature  in  view  morphing  and  matching  for  single  views,  and  in  Section  3  we  dis¬ 
cuss  a  feature  based  approach  to  sparse  view  morphing  that  also  shows  promise  for 
compression  of  the  model  data  corpus.  In  Section  4  we  describe  a  new  Bayesian  ap¬ 
proach  (optimal  in  a  Maximum  a  Posteriori  sense)  that  approaches  the  problem  from  a 
mid-level  segmentation  problem.  We  present  results  of  some  preliminary  experiments 
for  feature-based  morphing,  as  well  as  preliminary  demonstration  of  the  novel  object 
representation. 

2  Literature  Review 

View  morphing  has  been  a  subject  of  scientific  interest  for  well  over  a  decade.  Here 
we  review  some  of  the  recent  advancements  in  the  field.  Seitz  and  Dyer’s  [8]  static 
view  morphing  algorithm  consists  of  four  main  steps:  determining  the  fundamental  ma¬ 
trix,  prewarping,  morphing,  and  postwarping.  First,  eight  or  more  corresponding  points 
were  manually  selected  to  determine  a  fundamental  matrix,  F,  by  using  a  linear  algo¬ 
rithm.  Next,  the  two  original  images  were  warped  into  a  plane  parallel  to  the  camera 
baseline  using  epipolar  geometry.  Following  this,  a  user  manually  specified  a  set  of 
corresponding  line  segments  and  then  the  Beier-Neely  method  was  used  to  interpolate 
correspondences.  After  linearly  interpolating  the  two  parallel  views  based  on  the  dis¬ 
parity  map,  a  parallel  morphing  view  was  obtained.  To  obtain  a  realistic  final  view, 
a  quadrilateral  was  used  to  determine  the  postwarping  path  for  the  final  homography 
projection  using  linear  interpolation.  However,  the  postwarping  path  obtained  by  lin¬ 
ear  interpolation  may  cause  shrinking  problem  as  mentioned  in  [8, 10].  Manning  and 
Dyer  [5]  extended  static  view  morphing  to  dynamic  view  morphing.  However,  in  their 
scenario,  the  moving  objects  can  only  move  along  a  straight  line,  and  their  motion  is 
limited  to  only  translation.  Xiao  et  al.  [10]  relaxed  this  constraint  and  allowed  an  ar¬ 
bitrary  motion.  They  showed  that  a  rigid  dynamic  scene  is  equivalent  to  several  static 
scenes  with  different  epipolar  geometries  based  on  relative  motion.  They  also  extended 
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their  method  to  articulated  object  motion,  such  as  walking  and  arm  gestures,  and  ob¬ 
tained  photo-realistic  results.  Avidan  and  Shashua  [1]  proposed  their  work  on  tri-view 
synthesis  by  using  trifocal  tensor.  In  their  framework,  an  arbitrary  novel  view  can  be 
generated  at  any  3D  view  position  based  on  three  small  baseline  images,  where  the 
disparity  can  easily  be  determined  by  Lucas-Kanade  optical-flow  method.  It  will  be 
almost  impossible  for  Lucas-Kanade  method  to  work  for  the  wide  baseline  images. 
Pollard  et  al.  [6]  determined  edge  correspondences  and  used  interpolation  to  generate 
a  new  view  over  trinocular  images.  However,  since  they  cannot  guarantee  that  the  edge 
correspondences  are  correct,  their  disparity  map  was  computed  using  the  conventional 
edge- scanline  algorithm,  which  is  not  clean.  Therefore,  their  results  contain  a  lot  of 
artifacts  due  to  some  incorrect  correspondences.  Vedula  et  al.  [9]  proposed  view  inter¬ 
polation  over  spatio-temporal  domain,  where  1417  fully  calibrated  cameras  (with  small 
baseline)  were  used  on  the  one  side  of  the  actor/actress  to  capture  the  events.  In  their 
approach,  they  used  voxel  coloring,  3D  scene  flow,  and  ray-casting  algorithm  to  synthe¬ 
size  the  novel  view  over  these  original  image  sequences.  They  removed  the  background 
layer,  and  only  rendered  the  actor/actress  layers.  Their  results  contain  some  visible  ar¬ 
tifacts  due  to  the  errors  in  shape  estimation,  scene  flow,  etc.  Pollefeys  and  Van  Gool  [7] 
combined  3D  reconstruction  and  IBR  to  render  a  new  view  from  a  sequence  of  images. 
They  first  determined  the  relative  motion  between  consecutive  images,  and  then  recov¬ 
ered  the  structure  of  the  scene.  Next,  employing  unstructured  light  field  rendering,  they 
can  generate  a  virtual  view  by  using  view-dependent  texture.  Using  this  sequence  of  im¬ 
ages  (small  baseline),  they  accurately  estimated  dense  surface  of  the  scene,  which  can 
efficiently  improve  the  visual  effect  of  their  results.  Recently,  Zhang  et  al.  [1 1]  proposed 
to  use  feature-based  morphing  with  light  fields  to  obtain  very  realistic  3D  morphing.  In 
their  approach,  a  large  number  of  images  (hundreds  of  pictures)  were  taken  for  each 
object  using  an  array  of  calibrated  cameras.  Then,  several  feature  polygons  were  manu¬ 
ally  determined  employing  a  user  interface.  Using  the  corresponding  feature  polygons, 
they  generated  a  4D  light  field  and  grouped  the  corresponding  ray  bundles  for  reference 
images.  Finally,  a  novel  view  was  synthesized  using  blending  and  warping  functions  on 
reference  images. 

3  Sparse  (Feature-Based)  View  Morphing 

In  order  to  reason  about  the  continuity  of  pose  change  in  video,  it  is  important  to  esti¬ 
mate  either  the  exact  pose  of  some  representation  of  it.  In  our  proposal  we  described  a 
2D  approach  that  does  not  require  the  extraction  of  explicit  3D  pose,  instead  uses  view 
morphing  to  extract  a  image-based  representation  of  the  pose.  We  achieved  success  in 
reconstructing  intermediatary  views  as  shown  in  Figure  2.  However,  it  was  evident  that 
manual  work  is  inevitably  required  for  visually  plausible  view  morphing,  which  in  it¬ 
self  is  an  antithetical  to  Automatic  Target  Recognition.  Thus,  instead  of  using  dense 
morphing,  we  propose  the  use  feature  based  matching  instead,  firstly  since  detecting 
and  matching  salient  features  is  robust  and  reliable,  and  secondly,  since  it  satisfies  our 
requirements  of  calculating  likely  poses,  which  can  be  leveraged  for  subsequent  multi¬ 
view  (video)  matching.  To  that  end,  we  use  SIFT  features  [4]  for  detection  and  match¬ 
ing.  Since  the  data  set  observes  the  object  from  all  points  of  views,  we  constrain  of 
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Synthesized  Views 


Fig.  1.  IR  Images  and  interpolated  views. 


the  movement  of  any  interest  point  to  lie  on  an  ellipse  defined  by  the  positions  of  that 
interest  point  in  the  set  of  images.  Figure  4  shows  an  interest  point  (marked  by  ’o’)  as 
it  moves  along  an  elliptical  curve.  In  this  way  we  can  build  a  view  independent  model 
of  the  object  defined  only  by  the  ellipses  constraining  the  positions  of  certain  interest 
points  on  the  object  (or  target).  Figure  5  shows  paths  created  by  the  movement  of  5 
different  points  on  the  object  as  view  changes.  According  to  the  model,  the  positions  of 
the  interest  points  plugged  into  the  equation  of  the  defining  ellipse  should  equal  zero. 
The  residue  can  be  considered  a  measure  of  error.  Figure  6  shows  residues  for  one  true 
target  and  one  false  set  of  points.  We  are  looking  into  estimating  such  ellipse-based 
representations  for  IR  images  such  as  Figure  1  as  well. 

4  Target  Recognition 

In  this  section,  we  describe  the  object  segmentation  approach.  Rather  than  necessarily 
treat  images  as  a  collection  of  pixels  arranged  on  a  regular  lattice,  we  assume  a  more 
general  abstraction  of  an  image  as  a  set  of  segments  on  a  non-regular  lattice.  We  will  de¬ 
scribe  the  overall  Bayesian  object  segmentation  framework  first,  followed  by  a  detailing 
of  training  and  testing  steps. 

4.1  Bayesian  Framework 

By  Bayes  Theorem, 


p(£|xl5x2,...xn)  = 


ff(x  i,x2,...xn|£)p(£) 
p(x i,x2,  ...xn) 


(1) 


Assuming  second-order  dependency  between  adjacent  segment,  the  likelihood  term 
can  be  re-written  as, 


p(xi,  x2,  ...x„|£)  =  IHIp(xi,Xj|£)  =.  n  IF'**1'  Xy.  £)/>(x;-|£) 


(2) 


*  3 


*  3 
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Synthesized  Images 


Original  Images 


Fig.  2.  View  morphing  from  the  COIL  Data  Set. 


If  a  smoothness  prior  is  used,  it  can  be  enforced  in  the  decision  through  a  pairwise 
interaction  MRF  prior.  In  particular,  the  Ising  Model  is  attractive  for  its  discontinuity 
preserving  properties, 


v  v 

p(C)  oc  exp  (e^j  +  (1  -  £i)(l  -  £j)jj ,  (3) 

*= i  j=i 

where  A  is  a  positive  constant  and  i  ^  j  are  neighbors  as  defined  in  the  region- adjacency 
graph.  Ignoring  constant  terms,  that  do  not  affect  the  optimization,  the  posterior  term  is 
then, 


p  p 

p(£ |xi,x2,...x„)  oc  II fjp(xi|xj, £)p(xj|£)  exp  (EEa 

(4) 


i  j  i= 1  j= 1 

Taking  the  log  of  both  sides  and  collecting  terms, 


logp(£|xi,  ...xn)  oc  EE  logp(xi\xj,jC)+X^£i£j+(l-£i)(l-£j)Sj  +^logP(xi|£). 
i  j  V  /  j 

(5) 

The  MAP  estimate  is  the  binary  image  that  maximizes  L  and  since  there  are  2NM  pos¬ 
sible  configurations  of  C  an  exhaustive  search  is  usually  infeasible.  In  fact,  it  is  known 
that  minimizing  discontinuity-preserving  energy  functions  in  general  is  NP-Hard.  Al¬ 
though,  various  strategies  have  been  proposed  to  minimize  such  functions,  e.g.  Iterated 
Condition  Modes  or  Simulated  Annealing,  the  solutions  are  usually  computationally 
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(a)  (b) 


Fig.  3.  Several  ATR  results  using  view  morphing  database  in  COIL.  The  red  lines  correspond  to 
two  queried  images  from  the  same  object  which  can  be  correcly  identified  and  the  viewing  angles 
are  also  estimated.  The  blue  dot  lines  correspond  to  two  images  from  the  different  objects  2  and 
3  whose  SADs  do  not  coverge  to  a  small  value. 


expensive  to  obtain  and  of  poor  quality.  Fortunately,  since  L  belongs  to  the  T2  class  of 
energy  functions,  a  sum  of  function  of  up  to  two  binary  variables  at  a  time, 

E{x i,  ...xn)=YJ E^Xi)  +  E(i'j\xh Xj),  (6) 

i  h3 

and  since  it  satisfies  the  regularity  condition  of  the  so-called  T2  theorem,  efficient  al¬ 
gorithms  exist  for  the  optimization  of  L  by  finding  the  minimum  cut  of  a  capacitated 
graph.  To  maximize  the  energy  function,  we  construct  a  graph  Q  =  (V,  £ )  with  a  4- 
neighborhood  system  A! .  In  the  graph,  there  are  two  distinct  terminals  s  and  t ,  the 
sink  and  the  source,  and  n  nodes  corresponding  to  each  image  pixel  location,  thus 
V  =  {vi,V2,  •  •  •  ,vn,s,  t}.  A  solution  is  a  two-set  partition ,  U  =  {s}  U  {i\£i  =  1} 
and  W  =  {£}  U  {i\£i  =  0}.  The  graph  construction  is  with  a  directed  edge  (s,  i )  from 
s  to  node  i  with  a  weight  =  t*  (the  log-likelihood  ratio),  if  r*  >  0,  otherwise  a 

directed  edge  (i,  t)  is  added  between  node  i  and  the  sink  t  with  a  weight 
Undirected  edges  of  weight  =  A  are  added  if  the  corresponding  pixels  are  neigh¬ 
bors  as  defined  in  Af  (in  our  case  if  j  is  within  the  4-neighborhood  clique  of  i)  .  The 
capacity  of  the  graph  is  C(C)  =  w(i,j )»  and  a  cut  defined  as  the  set  of  edges 

with  a  vertex  in  U  and  a  vertex  in  W.  The  minimum  cut  corresponds  to  the  maximum 
flow,  thus  maximizing  L(£|x)  is  equivalent  to  finding  the  minimum  cut.  The  minimum 
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Fig.  4.  Corresponding  points  lie  on  a  conic  (assuming  an  orthographic  camera). 


cut  of  the  graph  can  be  computed  through  a  variety  of  approaches,  the  Ford-Fulkerson 
algorithm  or  a  faster  version  proposed  by  Greig  et  al.  The  configuration  found  thus 
corresponds  to  an  optimal  estimate  of  C. 


4.2  Data  Model 

The  training  data  for  each  object  is  first  segmented  using  mean-shift  segmentation  [3], 
which  is  a  non-parametric  clustering  approach  which  abstracts  each  segment  as  pixels 
with  a  common  mode  in  a  feature  space  ([l  u  v  x  y\  is  popular  choice).  From  these 
segments,  a  region  adjacency  graph  is  produced  from  the  segments  which  is  then  used 
first  for  training  and  then  subsequently  for  recognition.  To  evaluate  p(xz|£),  kernel 
density  estimation  on  the  [l  u  v\  is  employed,  where  each  positive  segment’s  mode 
providing  a  data  point  in  the  feature  space.  To  evaluate  p(x^|xj,  £),  we  create  a  joint 
feature  space  [fa  )  u(i)  v(i)  lU)  uU) 

which  we  populate  with  every  pair  of  adjacent 
regions  in  the  training  region  adjacency  graphs. 
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Fig.  5.  Paths  created  by  the  movement  of  5  different  points  on  the  object  as  view  changes. 
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Fig.  6.  Residues  for  one  true  target  and  one  false  set  of  points. 
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1  Introduction 

In  this  report,  we  present  our  works  in  automatic  target  recognition  using  video,  which  is  one  of  the  major 
tasks  of  the  proposal.  In  Section  2,  we  first  introduce  a  novel  approach  for  adaptive  video  registration.  In 
this  approach,  the  robust  layers  from  a  mission  video  sequence  are  automatically  extracted  and  a  layer  mo¬ 
saic  is  generated  for  each  layer,  where  the  relative  transformation  parameters  between  consecutive  frames 
are  estimated.  Then,  we  formulate  the  image-registration  problem  as  a  region-partitioning  problem,  where 
the  overlapping  regions  between  two  images  are  partitioned  into  supporting  and  nonsupporting  (or  outlier) 
regions.  To  determine  the  corresponding  motion  parameters,  we  estimate  a  set  of  sparse,  robust  correspon¬ 
dences  between  the  first  frame  and  reference  image.  Starting  from  corresponding  seed  patches,  the  aligned 
areas  are  expanded  to  the  complete  overlapping  areas  for  each  layer  using  a  graph-cut  algorithm  with  level 
set.  Next,  we  estimate  the  transformation  parameters  from  the  mosaic  and  align  the  remaining  frames  in  the 
video  to  the  reference  image.  Finally,  using  the  same  partitioning  framework,  the  registration  is  further  refined 
by  adjusting  the  aligned  areas  and  removing  outliers. 

In  Section  3,  we  describe  a  approach  for  video-based  object  recognition  which  doesn’t  require  any  knowl¬ 
edge  of  camera  position  or  physical  location  of  images  with  respect  to  each  other.  This  approach  involves  a 
sparse  2D  model  and  object  matching  on  the  basis  of  video.  The  model  is  generated  based  on  geometry  and 
image  measurements  only.  We  first  identify  the  underlying  topological  structure  of  an  image  dataset  and  rep¬ 
resent  it  as  a  neighborhood  graph.  The  graph  is  then  refined  by  identifying  redundant  images  and  removing 
them  using  view  morphing.  This  gives  a  smaller  dataset  leading  to  reduced  space  requirements  and  faster 
matching.  Finally  we  exploit  motion  continuity  in  video  and  extend  our  algorithm  to  perform  matching  based 
on  video  input  and  demonstrate  that  the  results  obtained  using  a  video  sequence  are  much  robust  than  using 
a  single  image.  Preliminary  experimental  results  for  both  approaches  are  presented  and  discussed. 

2  Video  Registration 

Image  registration  and  alignment  have  been  studied  for  a  long  time  in  different  areas,  including  photogram- 
metry,  remote  sensing,  image  processing,  computer  graphics,  medical  imaging,  and  computer  vision  [1-3]. 
Registration  techniques  can  be  classified  based  on  the  following  two  factors:  the  motion  model  between 
mission  and  reference  images,  and  the  method  of  alignment  [2]. 

The  motion  model  depends  on  the  geometry  of  the  imaged  scene  and  dynamics  of  the  sensor  and  object 
motion.  Given  two  images  of  a  planar  scene,  a  single  motion  model  (affine  or  projective)  can  be  fitted  using 
the  existing  registration  approaches  (Fig.  1(a)).  For  a  scene  containing  multiple  planes  (or  layers),  it  is  difficult 
to  obtain  correct  registration  using  only  two  images  (mission  and  reference)  due  to  the  inconsistent  motion 
model.  Hence,  the  registration  may  overfit  one  layer  or  the  layer  boundaries  may  not  be  accurate  [4].  How¬ 
ever,  given  a  video  sequence,  an  accurate  layer  segmentation  can  be  obtained  by  exploiting  spatiotemporal 
information  [5-8]  (Fig.  1(b)),  which  makes  it  possible  to  perform  the  layer-based  registration. 

Alignment  methods  can  be  broadly  categorized  into  three  classes:  intensity-based  (or  appearance)  meth¬ 
ods,  feature  based  methods,  and  hybrid  methods.  The  intensity-based  methods  are  based  on  the  well-known 
optical  flow  constraint  equation  [9],  which  can  be  solved  by  minimizing  the  sum  of  squares  of  pixelwise 
differences  (SSD).  Generally,  these  methods  are  more  useful  for  frame-to-frame  registration  of  video  frames 
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Fig.  1.  Depending  on  the  scene,  a  video  sequence  can  be  represented  by  one  layer  (a)  or  multiple  layers  (b).  (a)  This 
scene  can  be  approximated  by  one  plane  due  to  the  nonparallax  camera  motion  and  the  generated  mosaic,  (b)  “Flower 
garden”  sequence,  which  can  be  represented  by  three  layers:  tree,  garden,  and  background. 


with  a  simple  camera  motion,  where  the  pixel  motion  is  small  and  the  image  intensities  are  similar  [10, 1 1]. 
In  the  feature-based  methods,  the  main  steps  include:  finding  robust  features,  establishing  correspondences, 
fitting  some  transformation,  and  applying  the  transformation  to  warp  the  images  [12, 13].  These  methods  are 
relatively  fast  and  more  suitable  for  the  registration  of  two  dissimilar  images  with  a  large  and  complicated 
motion  or  transformation.  Recently,  several  hybrid  methods  have  been  proposed  to  integrate  the  merits  of 
intensity -based  and  feature-based  methods  [14, 15].  In  these  methods,  a  set  of  features  is  extracted,  then  an 
iterative  optimization  procedure  is  applied  to  the  supporting  regions  around  these  features  to  minimize  some 
dissimilar  measurements. 

Currently,  some  registration  problems,  such  as  video  mosaicing  and  registration  of  video  acquired  by  an 
airborne  sensor  to  a  reference  image  in  the  presence  of  camera  information  [14, 16, 17],  have  been  solved 
quite  well.  However,  some  problems  in  this  area  remain  unresolved.  First,  how  do  we  obtain  a  reliable  initial 
estimation  of  motion  parameters  if  the  camera  information  (e.g.  location,  viewing  angles,  and  sensor  model) 
is  not  available?  Particularly,  if  camera  location  and  orientation  are  quite  different,  such  as  wide  baseline 
images,  the  initial  estimation  usually  is  quite  difficult.  Second,  how  do  we  deal  with  outlier  regions  when 
the  images  are  taken  at  different  times?  These  regions  may  look  different  due  to  appearing  and  disappearing 
objects,  such  as  moving  objects,  shadows,  and  vegetation.  Therefore,  only  a  part  of  the  image  may  be  useful 
for  the  registration.  Third,  How  do  we  handle  complex  motion  models  in  a  single  3D  scene,  such  as  multiple 
homographes  shown  in  Fig.  1(b)?  Most  existing  approaches  ignore  these  problems  and  attempt  to  align  the 
whole  image  using  a  single  motion  model  regardless  of  the  number  of  layers. 

With  the  aim  of  addressing  the  above  limitations  of  the  current  methods,  we  propose  a  novel  framework 
to  perform  video  registration  of  a  3D  scene,  which  can  be  approximated  by  multiple  planes,  without  any 
knowledge  of  the  metadata.  In  particular,  given  an  image  sequence  of  a  mission  or  inspection  video,  we  want 
to  register  it  to  a  reference  image,  which  may  be  taken  at  a  different  time,  location  and  orientation.  The 
proposed  approach  first  uses  a  motion  layer  extraction  algorithm  [8]  to  obtain  an  accurate  layer  segmentation 
of  the  mission  video  by  exploiting  spatiotemporal  information.  For  each  layer,  a  mosaic  is  generated  and 
the  relative  transformation  parameters  between  consecutive  frames  are  estimated.  Then,  we  formulate  the 
image  registration  problem  into  a  partitioning  framework,  where  the  overlapping  regions  between  two  images 
are  partitioned  into  supporting  and  non- supporting  regions  for  the  registration.  In  this  framework,  a  region 
expansion  process  is  designed  to  adaptively  propagate  the  alignment  process  from  the  high  confidence  seed 
regions  to  the  low  confidence  areas  and  simultaneously  remove  outlier  regions.  In  order  to  obtain  such  starting 
seed  regions,  we  apply  a  wide  baseline  algorithm  [18]  to  compute  a  set  of  reliable  seed  correspondences 
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(a)  (b) 


Fig.  2.  (  a  )  Three  frames  from  a  mission  video  are  shown  on  the  left  that  are  to  be  registered  to  a  single  layer  in  the 
reference  image  shown  on  the  right.  ( b )  Three  frames  from  a  mission  video  are  shown  on  the  left  that  are  to  be  registered 
to  two  layers  in  the  reference  image  shown  on  the  right. 


between  the  first  mission  frame  and  reference  image.  Then,  starting  from  the  seed  regions,  the  initially  aligned 
areas  are  expanded  to  the  whole  overlapping  areas  using  a  graph  cut  algorithm  integrated  with  the  level  set 
representation  of  the  previous  regions.  Consequently,  we  achieve  a  robust  layer  alignment  for  each  layer  using 
the  relative  motion  parameters  estimated  by  the  layer  mosaic,  and  the  final  multi-layer  video  registration  is 
obtained  after  back  projection  of  layers. 

2.1  Single  Layer  Registration 

In  a  planar  scene,  only  one  layer  is  available,  as  shown  in  Fig.  2(a).  It  is  easy  to  generate  a  mosaic  for  this  layer 
using  an  affine  or  projective  motion  model.  However,  if  the  scene  contains  multiple  layers,  the  motion  models 
can  vary  from  a  simple  global  motion  model  to  multiple  motion  models,  where  pixel  motions  are  mapped 
to  several  parameter  clusters.  Figure  2(b)  shows  one  example  of  this  case  from  a  “door-wall”  sequence, 
which  contains  two  layers.  It  is  impossible  to  obtain  one  mosaic  using  this  mission  video  without  severe 
misalignment  or  distortion.  Fortunately,  in  the  context  of  video  registration,  temporal  information  is  available 
in  the  mission  video  sequence,  from  which  the  motion  layers  of  the  scene  can  effectively  be  extracted.  In 
this  paper,  we  use  a  multiframe  graph-cut  framework  [8]  to  achieve  an  accurate  layer  segmentation  of  the 
mission  video  sequence.  After  the  motion  segmentation,  we  obtain  precise  supporting  regions  for  each  layer 
and  the  corresponding  motion  parameters  between  each  consecutive  frame,  which  can  be  used  as  the  initial 
parameters  for  layer  mosaicing. 

Since  the  gap  between  consecutive  frames  of  a  video  sequence  is  small,  it  is  better  to  use  an  intensity- 
based  registration  method  to  minimize  the  image  residue  (or  SSD),  which  can  be  written  as 

e  =  ^[/2(H-x)-/1(x)]2,  (1) 

Q 

where  I\  and  I 2  are  two  original  images,  Q  is  the  overlapping  area  between  two  consecutive  frames,  xG  S2 
are  the  homogenous  coordinates  and  H  G  5R3x3  is  a  homography  matrix  between  two  frames.  Starting 
with  H  =  I  (identity  matrix),  a  nonlinear  approach,  such  as  LevenbergCMarquardt  method,  can  be  used 
to  iteratively  minimize  the  residue  [11].  In  this  method,  after  computing  image  gradient  V/  from  the  two 
images,  a  gradient-descent  direction  is  estimated  that  leads  to  a  local  minimum. 

The  transformation  Hi  between  mission  frame  i  with  the  mosaic  becomes  known  after  a  mosaic  is  gen¬ 
erated  for  each  layer.  Therefore,  we  have  two  choices  when  it  comes  to  aligning  the  mission  video  to  the 
reference  image  as  shown  in  Fig.  3.  In  the  first  scheme,  after  aligning  the  layer  mosaic  to  the  reference  image 
with  transformation  F,  an  initial  transformation  for  a  mission  frame  i  to  the  reference  image  can  be  computed 
by  Ti  =  F Hi .  However,  in  this  scheme,  the  error  between  frame  fi  with  frame  /1  will  be  accumulated  with 
i  increasing,  which  may  not  provide  a  good  estimation  between  the  layer  mosaic  and  the  reference  image. 

In  our  work,  we  use  an  alternative  solution,  whereby  each  frame  fi  is  directly  registered  to  the  reference 
image  based  on  the  previous  transformation  of  frame  fi- 1.  First,  we  align  the  first  frame  to  the  reference 
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Fig.  3.  The  transformation  among  mission  frames,  reference  image,  and  layer  mosaic  for  one  georegistration  sequence. 
Hi  is  the  transformation  between  mission  frame  i  with  mosaic.  F  is  the  transformation  between  mosaic  and  reference 
image.  Ti  is  the  final  transformation  between  mission  frame  i  and  reference  image. 


image  and  estimate  its  corresponding  transformation  T\  by  determining  corresponding  seed  regions  and  using 
the  region  expansion  approach.  Then  the  initial  transformation  for  the  second  frame  can  be  simply  computed 
by  X2  =  TiH^1H2,  which  can  be  further  refined  only  using  the  region  expansion  process  without  computing 
seed  correspondences.  After  estimating  the  precise  transformation  Tx_  \  of  fi-i,  we  iteratively  compute  the 
initial  transformation  for  frame  i  by  Tx  =  Ti-\H~\HX  using  the  previous  frame  /*_ 1.  As  a  result,  we  can 
avoid  the  accumulated  error  of  the  mosaic  since  the  initial  transformation  is  always  computed  employing  the 
previous  frame  /*_i  instead  of  the  first  frame  /1 .  Hence,  before  registering  the  whole  mission  video  sequence, 
we  have  to  align  frame  /1  to  the  reference  image  and  compute  T\ . 


2.2  Layer  Registration 

The  first  issue  that  has  to  be  tackled  for  layer  registration  algorithm  development  is  the  decision  to  use  either 
sparse  or  dense  image  features  for  registration.  Given  two  wide-baseline  images  without  any  metadata,  it  is 
difficult  to  perform  alignment  due  to  illumination  variations  and  large  motion  between  two  images.  Therefore, 
the  use  of  sparse  image  features  is  ideal  for  the  fast  estimation  of  initial  motion  parameters.  However,  due  to 
outliers  and  inaccuracy  of  these  correspondences,  the  initial  registration  is  usually  not  good  enough.  In  this 
section,  we  propose  a  two-stage  approach  to  integrating  the  merits  of  the  sparse  and  dense  image  features.  In 
the  first  stage,  we  determine  a  set  of  sparse  correspondences  between  the  mission  and  reference  images.  Then, 
starting  from  the  initial  seed  correspondences,  the  aligned  regions  are  gradually  expanded  to  cover  the  whole 
overlapping  areas  between  both  images.  At  the  same  time,  the  outlier  regions,  such  as  appearing/disappearing 
objects  that  may  harm  the  registration  process,  are  detected  and  removed. 

There  are  several  methods  for  computing  robust  correspondences  for  wide-baseline  images  [19,20].  Here 
we  make  use  of  our  previous  algorithm  [20]  to  determine  a  set  of  reliable  corresponding  corners.  In  this 
approach,  a  set  of  edge-corners  is  identified  in  both  images  that  provide  robust  and  consistent  matching 
primitives.  Figure  4  shows  the  detailed  matching  process  for  a  small  patch.  After  computing  the  best  affine 
transformation  and  illumination  coefficients  between  the  two  patches  in  mission  and  reference  images,  the 
patch  in  I2  (Fig.  4(d))  is  warped  as  shown  in  Fig.  4(e),  which  has  a  similar  appearance  to  the  patch  in  Ii 
(Fig.  4(c)),  and  the  residue  is  minimized. 

Once  the  seed  correspondences  are  estimated  between  the  mission  and  reference  images,  our  purpose  is 
to  perform  registration  for  each  plane  (layer)  in  the  scene.  For  each  pair  of  correspondences,  we  consider 
a  small  patch  centered  around  each  seed  region,  which  can  be  approximated  as  a  planar  patch  (or  an  initial 
layer)  in  the  scene.  Therefore,  we  get  a  number  of  initial  layers,  and  each  layer  is  supported  by  a  small  square 
region  with  its  corresponding  affine  transformation.  This  implies  that  the  corresponding  small  square  regions 
in  the  mission  and  reference  images  are  aligned  by  this  initial  affine  transformation. 
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Fig.  4.  Determining  correspondences  between  the  mission  frame  and  reference  image,  (a)  Mission  image,  (b)  Small 
part  of  reference  image.  Several  correspondences  are  computed  by  the  wide-baseline  matching  algorithm,  each  pair  of 
correspondences  is  marked  by  squares  with  the  same  color,  (c)-(c)  Matching  process  of  green  (top  row)  and  blue  (bottom 
row)  corners,  (c)  A  patch  from  (a),  (d)  Corresponding  patch  from  (b).  (e)  Warped  patch  (d)  obtained  after  applying  the 
best  affine  transformation,  where  patch  (e)  is  similar  to  patch  (c).  NB:  Compared  to  the  original  patches  (c)  and  (d),  the 
illumination  effect  is  partially  compensated  between  (c)  and  (e)  by  estimating  p  and  5. 


Nevertheless,  this  minimization  process  may  create  two  problems.  First,  the  estimated  parameters  ob¬ 
tained  by  using  the  small  patch  may  overfit  the  pixels  inside  the  region  and  may  not  correctly  represent  the 
global  transformation  of  a  larger  region.  Second,  this  process  ignores  the  appearing/disappearing  objects  be¬ 
tween  two  images,  such  as  the  moving  objects,  occlusion  areas,  and  shadows.  To  overcome  the  problems 
described  above,  we  expand  the  region  boundary  to  obtain  more  supporting  pixels  that  are  consistent  with  the 
motion  parameters  and  also  to  identify  the  outlier  pixels.  Then  we  iteratively  refine  the  motion  parameters 
using  these  supporting  pixels.  Therefore,  this  registration  problem  is  essentially  converted  into  a  partitioning 
problem  that  can  be  stated  as  follows:  Determine  the  optimal  supporting  regions  and  their  corresponding 
motion  parameters  for  image  registration. 

Our  registration  problem  can  be  recast  into  the  graph-cut  framework.  In  this  framework  [1],  we  seek 
the  labeling  function  /  that  partitions  the  pixels  in  region  Q  into  two  groups:  the  first  group  represents  the 
supporting  regions,  labeled  /  =  0;  the  other  represents  the  outlier  regions,  labeled  f  —  1.  This  partitioning 
can  be  achieved  by  minimizing  the  following  energy  function: 

E  =  Y,  V(p'  5)  +  E  Dp(/p)  (2) 

(p,  q)eN  pen 

where  the  first  term  is  a  piecewise  smoothness  term,  the  second  term  is  a  data  penalty  term,  N  is  a  4-neighbor 
system,  and  fp  is  the  label  of  a  pixel  p.  Dp(fp )  can  be  approximated  by  a  Heaviside  function. 

To  minimize  the  energy  function,  a  weighted  graph  G  =  (V,  E)  is  constructed,  where  V  is  a  node  set 
(image  pixels)  and  E  is  a  link  set  that  connects  the  nodes.  After  assigning  weights  for  the  links,  we  can 
compute  a  minimum  cut  C  using  a  standard  graph-cut  algorithm  and  partition  the  original  region  into  the 
supporting  and  outlier  regions.  However,  using  this  process  we  cannot  expand  the  region  from  the  initial  seed 
patch  to  the  exterior  to  obtain  more  supporting  pixels.  Hence,  we  must  use  the  contour  of  the  previous  seed 
region  prior  to  computing  the  level  set  representation  for  this  region  [21, 22],  which  allows  the  region  contour 
to  evolve  along  the  normal  direction.  After  enforcing  the  level  set  regulation  on  the  sink- side  weight  of  graph 
Q,  we  can  effectively  control  the  graph-cut  algorithm  to  gradually  expand  the  seed  region. 

Figure  5  shows  a  detailed  expansion  process  starting  from  one  initial  seed  region.  Figures  5(a)  and  (b) 
show  the  initial  contours  of  the  corresponding  seed  regions.  Based  on  the  initial  contour  of  the  original  seed 
region  i?°  (Fig.  5(b)),  we  construct  a  mask  /3  of  this  region,  which  has  a  value  in  [0, 1],  where  the  interior 
pixels  of  the  region  are  marked  by  1  and  the  others  are  marked  by  0.  Then,  a  level  set  </>  (Fig.  5(e))  can  be 
simply  computed  by  convolving  the  region  mask  with  a  Gaussian  kernel  as:  =  G  *  /3,  where  the  value 
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Fig.  5.  Region  expansion  process,  (a),  (b)  Initial  corresponding  patch  contours  in  the  reference  and  mission  images, 
respectively,  (c)  Final  registration  result,  where  the  intensities  of  the  embedded  mission  image  are  adjusted  by  illumination 
coefficients  /x  and  5.  (d)  Simple  expansion  and  partitioning  started  from  the  initial  contour  shown  in  (a),  (e)  Level  set 
representation  of  initial  contour  (a),  (f)-(h)  Intermediate  results  using  graph-cut  method  with  the  level  set  representation, 
which  can  guarantee  that  the  expansion  gradually  evolves  from  the  center  to  a  boundary.  NB:  The  green  boxes  in  (a)  and 
(b)  are  the  initial  seed  regions,  (f)-(h)  Difference  images  between  the  warped  (b)  and  (a)  and  the  green  contours  in  (f)-(h) 
are  supporting  region  boundaries  obtained  after  using  a  bipartitioning  algorithm.  The  nonsupporting  pixels  are  masked 
by  red. 


of  </>  falls  down  along  the  contour  normal  direction  until  <f>p  =  0.  Then,  we  warp  the  second  image  using 
the  corresponding  homography  and  construct  a  graph  Q  for  the  pixel  with  4>p  >  o.  After  that,  we  apply 
the  level  set  (j>  to  change  the  weight  of  the  sink-sideLlink  for  each  pixel,  such  that  the  weights  of  the  pixels 
inside  the  region  are  almost  unchanged  while  the  weight  (p,  t)  will  decrease  when  the  pixel  p  is  away 
from  the  boundary.  As  a  result,  the  minimum  cut  C  is  most  likely  to  exclude  the  outside  pixels  and  label 
them  as  the  non- supporting  pixels  for  this  region.  This  way,  the  new  expanded  supporting  region  Q1  can 
be  computed  as  shown  in  Fig.  5(f).  After  several  iterations  as  shown  in  (Figs.  5f-h),  the  region’s  boundary 
gradually  propagates  from  the  center  to  the  exterior  until  it  reaches  the  overlapping  boundary  of  two  images, 
and  the  alignment  is  stable.  Figure  5(h)  shows  the  final  region  i?5  after  five  iterations,  and  Fig.  5(c)  shows  the 
final  registration  results  using  the  projective  transformation  computed  by  this  approach.  If  several  initial  seed 
regions  share  the  same  motion  transformation  for  some  layer,  we  expand  the  multiple  regions  simultaneously 
to  speed  up  the  registration  process.  Figure  5  shows  that  our  approach  can  obtain  the  piecewise  smooth  region 
expansion,  which  is  insensitive  to  noise.  The  outlier  regions  due  to  shadows  are  also  detected  and  removed.  At 
the  same  time,  the  transformation  T\  for  the  key  frame  is  estimated.  After  applying  the  initial  transformation 
Ti  =  to  frame  fi,  we  initialize  the  alignment  of  frame  i  to  the  reference  image.  Then,  employing 

the  region  expansion  approach  to  the  ith  frame,  we  remove  outliers  and  refine  the  alignment  to  compute  the 
transformation  Ti  for  this  frame.  The  final  video  registration  results  are  shown  in  Fig.  6. 


2.3  Experiments 

We  performed  several  experiments  on  different  real  data  sets,  where  the  metadata  information  was  not  avail¬ 
able.  In  all  of  the  experiments,  we  applied  the  wide-baseline  matching  algorithm  to  estimate  sparse  corre¬ 
spondences,  which  can  provide  an  approximated  initial  alignment  between  mission  and  reference  images. 
For  a  single  layer  registration,  after  determining  the  sparse  correspondences,  we  expanded  these  seed  regions 
simultaneously  to  speed  up  the  alignment  process.  The  initial  homography  between  two  images  could  be 
computed  in  two  ways:  select  the  most  robust  affine  transformation  of  the  seed  regions  using  the  RANSAC 
technique  or  estimate  a  homography  voted  on  by  all  of  these  correspondences. 

In  Fig.  7,  we  show  an  example  of  the  multi-seed  expansion  process.  Since  a  number  of  correspondences 
are  determined,  it  is  easy  to  estimate  a  robust  initial  homography  using  all  the  correspondences.  Then,  starting 
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Fig.  6.  Video  registration  results,  (a)  Mission  video  frames,  (b)  Registration  results  for  several  frames,  where  the  mission 
images  are  superimposed  on  the  reference  image,  (c)  Full  registration  of  all  mission  video  frames. 


Fig.  7.  Registration  using  multiseed  region  expansion,  (a)  Mission  image,  (b)  Small  part  of  a  reference  image.  The  cor¬ 
respondences  are  marked  in  a  and  b  by  the  same  colors,  (c)  Initial  seed  regions  and  (d)  corresponding  level  set  represen¬ 
tation.  (e)  Final  region  contour  after  expansion,  where  the  nonsupporting  regions  are  indicated  by  red.  (f)  Registration 
results,  (g)  Checkered  display  after  alignment,  (h),  (i)  Zoomed  alignment  results  before  and  after  applying  the  region 
expansion  alignment. 


with  the  initial  homography,  we  expand  all  the  initial  seed  regions  simultaneously  until  the  overlapping  areas 
between  the  mission  and  reference  images  are  covered.  Our  graph-cut  algorithm  also  detects  and  removes  the 
outlier  regions,  most  of  which  are  due  to  vegetation  or  shadows.  Figures  7(h)  and  (i)  compare  the  zoomed 
results  before  and  after  applying  the  region  expansion  process. 

Figure  8  shows  another  set  of  results  for  geo-registration  using  single  seed  region  expansion  where  only 
three  correspondences  are  determined  due  to  the  small  size  of  the  mission  frame.  Since  we  cannot  obtain  a 
good  initial  projective  transformations  from  these  few  correspondences,  we  use  RANSAC  to  determine  the 
robust  affine  transformation  of  the  seed  regions,  which  is  shown  in  blue  in  Figs.  8(a)  and  (b).  Then,  starting 
from  one  seed  region  (blue),  we  perform  the  adaptive  region  expansion  alignment  and  obtain  the  registration 
results  as  shown  in  Figs.  8(c)  and  (d). 

Figure  9  shows  the  final  registration  results  for  the  doorwall  sequence.  After  obtaining  the  layers  for  each 
frame,  we  align  the  different  layers  to  the  reference  image  separately  using  the  adaptive  region  expansion  ap¬ 
proach.  The  final  registration  results  of  the  first  frame  are  shown  in  Figs.  9(d)  and  (e).  Compared  to  the  direct 
registration,  our  approach  has  two  advantages.  First,  to  align  the  corresponding  layers,  we  employ  different 
sets  of  motion  parameters  to  correctly  represent  the  mapping  of  the  pixels  in  these  layers.  Second,  the  layer 
segmentation  also  provides  accurate  supporting  regions  for  each  layer,  which  prevent  the  region  expansion 
process  across  the  layer  boundaries.  Therefore,  for  each  layer  registration,  our  approach  can  effectively  avoid 
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Fig.  8.  Registration  using  one  seed  region  expansion,  (a)  Small  part  of  a  reference  image,  (b)  Mission  image,  (c)  Regis¬ 
tration  results,  (d)  Checkered  display  after  alignment. 


Fig.  9.  Multiple-layer  registration,  (a)  Reference  image,  (b),  (c)  Layers  of  door  and  wall  in  frame  1.  (d)  Registration 
results  of  frame  1.  (e)  Checkered  display  after  alignment.  Middle:  Some  mission  video  frames.  Bottom:  Corresponding 
video  registration  for  these  frames,  where  the  mission  images  are  split  into  two  parts  during  the  registration. 


the  pixels  from  the  other  layers  and  achieve  more  accurate  aligned  regions  for  each  layer.  In  all  of  our  exper¬ 
iments,  after  determining  correspondences,  the  computational  time  for  a  single  layer  registration  is  less  than 
10s  per  frame. 


3  Model  Generation  for  Video-based  Object  Recognition 

In  this  section,  we  present  a  strategy  for  object  recognition.  This  needs  techniques  to  extract  information 
on  the  basis  of  the  contents  of  the  images  without  human  intervention  and  identify  the  objects  present  in 
them.  What  makes  this  task  difficult  is  the  difference  of  interpretation  of  images  by  humans  and  software 
systems.  Most  current  systems  operate  on  the  low-level  features  like  color  or  texture,  directly  extracted  from 
the  pixel,  while  humans  go  for  the  semantic  meaning  or  the  high-level  features  present  in  the  image,  like  the 
objects  or  scene  it  contains  [23].  Hence,  one  of  the  major  tasks  in  image  analysis  is  to  identify  the  different 
objects  present  in  the  scene.  Another  major  difficulty  arises  due  to  the  fact  that  the  operating  conditions 
may  differ  significantly  from  those  of  training  and  are  not  anticipated  [24].  The  major  issues  and  needs  of 
object  recognition  include  good  representation  of  object  models  and  backgrounds,  adaptation  to  facet  or 
environment  changes,  good  features  for  object  representation  and  efficient  use  of  a  priori  knowledge  about 
object  signatures  [25]. 

There  are  a  variety  of  approaches  explored  for  object  recognition,  like,  CAD-based,  appearance-based  and 
shape-based  methods;  however,  each  approach  has  its  own  set  of  limitations  [26].  In  each  of  the  techniques, 
a  model  is  generated  which  is  then  compared  with  an  image  to  identify  the  object  being  tested.  However, 


8 


Fig.  10.  (a)  Tessellating  the  viewing  space.  (b)Matching  of  Feature  points. 


object  recognition  from  a  single  view  may  fail  when  there  is  much  similarity  among  test  objects  or  when  the 
background  clutter  or  partial  occlusion  masks  features  of  the  object.  Selinger  et.  al  [27]  used  multiple  fixed 
cameras  of  known  pose  to  apply  single  view  object  recognition  system  over  a  sequence  of  imagery.  However, 
they  did  not  find  any  significant  advantage  of  this  approach.  Later  on,  Zhou  et.  al  [28]  successfully  utilized 
the  temporal  information  present  in  video  sequences  for  face  recognition.  They  formulated  a  probabilistic 
model  merging  the  dynamics  and  identity  of  humans  obtained  from  video.  However,  they  assumed  certain 
constraints  in  the  motion  of  persons  while  gathering  their  test  data.  Javed  et.  al  [29]  presented  a  probabilistic 
framework  for  general  object  recognition  using  a  video  sequence  containing  different  views  of  an  object. 
They  generated  a  model  for  each  object  in  the  training  set  capturing  images  at  known  viewing  angles  of 
camera  and  poses  of  objects. 

In  our  approach,  we  use  a  set  of  reference  images  to  generate  an  online  sparse  2D  model,  estimate  the 
underlying  topological  structure  and,  arrange  them  in  the  form  of  a  connectivity  graph.  We  refine  the  graph 
using  morphing,  so  as  to  remove  the  redundant  images  and  finally  use  video  matching  for  recognition  of 
objects.  The  strength  of  our  approach  is  that  we  don’t  need  to  know  the  object  pose  beforehand;  and  the  video 
sequence  could  be  shot  over  any  arbitrary  trajectory  with  objects  following  an  unconstrained  (but  smooth) 
path.  The  use  of  video  rather  than  a  single  image  increases  the  confidence  measure  of  the  match. 


3.1  Model  Generation  For  Objects 

Any  object  can  be  modeled  using  either  an  object-centered  or  a  view-centered  representation  [30, 31].  The 
object-centered  representations  use  the  features  from  the  objects,  like  boundary  curves,  surfaces  etc,  to  de¬ 
scribe  the  volumes  of  space.  View-centered  representations,  on  the  other  hand  depend,  on  the  outlook  of 
objects  from  different  viewpoints.  These  involve  the  use  of  aspect  graphs  and  silhouettes  for  modeling.  We 
have  used  the  view-centered  representation  for  generation  of  database,  which  makes  the  task  of  matching 
simpler.  This  is  because  the  need  for  projection  of  model  to  3D  is  no  longer  there  and  the  features  that  are 
to  be  compared  are  in  2D  [31].  The  input  to  our  database  generation  algorithm  is  a  set  of  reference  images, 
which  have  been  arbitrarily  extracted  from  a  video  sequence  shot  around  an  object.  Our  system  tessellates  the 
images  around  the  viewing  space  of  the  object,  as  shown  in  Fig.  10(a).  The  algorithm  generates  a  neighbor¬ 
hood  graph,  where  each  image  is  identified  as  a  node  and  the  links  between  neighbors  are  specified  as  edges. 
The  images  are  defined  as  neighbors  on  the  basis  of  their  logical  proximity  and  extent  to  which  they  match 
with  each  other. 


Development  of  Neighborhood  Graph  Given  a  set  of  reference  images,  we  propose  a  novel  approach  to 
tessellate  them  around  the  viewing  space  of  the  object  while  ensuring  a  minimal  size  of  the  database.  The 
algorithm  begins  by  identifying  the  feature  points  in  all  the  images  of  the  repository.  We  have  used  the  Scale- 
Invariant  Feature  Transform  (SIFT)  Operator  to  extract  the  distinctive  features  in  the  image.  The  features 
are  invariant  to  image  scale  and  rotation;  and  robust  to  changes  in  viewpoints  and  illumination.  Feature 
coorespondences  are  then  identified  using  a  fast  nearest-neighbor  algorithm  [32],  which  are  ultimately  used 
to  decide  the  presence  or  absence  of  linkage  between  nodes.  Figure  10(b)  shows  SIFT  points  and  matches 
identified  for  a  pair  of  images. 


9 


Fig.  11.  Neighborhood  Graph  for  a  car. 


For  an  image  database  having  originally  N  images,  an  N  x  N  link  matrix  is  formed.  A  link  between 
image  pair  (Lh,  Ij)  is  marked  if  they  are  found  to  be  neighbor.  The  procedure  for  identification  of  neighbors 
is  two-fold.  In  the  first  pass,  we  find  the  average  Euclidean  distance  d  for  each  image  pair  (/*,  Ij).  For  c 
corresponding  points  between  two  images,  we  have: 


For  each  image,  the  pair  with  the  minimum  distance  is  selected  as  the  neighbor,  and  an  edge  is  marked 
between  them.  Considering  this  attribute  as  our  seed  point,  we  expand  the  region  to  include  all  those  images 
in  the  neighborhood  block,  whose  Euclidean  distance  falls  within  25%  of  the  minimum  value.  This  accounts 
for  the  out  of  plane  images  and  handles  arbitrary  viewpoints. 

In  the  second  pass,  we  apply  the  physical  proximity  constraint  between  successive  video  frames.  This  im¬ 
plies  that  two  consecutive  frames  of  a  video  sequence  represent  two  images  in  proximity  and  hence  represent 
neighbors.  Therefore: 

Neighbor(Ii ,  Ii+i)  =  1,  V  i  G  Set  of  Frames  (4) 

It  may  be  noted  that  this  second  criterion  improves  the  connectivity  of  the  graph.  In  cases,  where  the 
image  set  is  not  from  a  true  video  sequence,  and  represents  an  arbitrary  collection  of  images,  only  the  first 
criterion  would  suffice.  Figure  1 1  shows  a  portion  of  a  graph  that  is  generated  for  a  car.  Such  a  graph  is 
generated  for  each  object  and  stored  as  a  model. 

Multi- view  Morphing  Once  the  Neighborhood  graph  is  generated,  it  is  refined  using  view  morphing.  Seitz 
et.  al  [33]  introduced  view  morphing  to  generate  novel  views  from  varying  viewpoints  using  only  two  im¬ 
ages.  Their  approach  is  based  on  the  principles  of  projective  geometry,  which  can  explicitly  preserve  3D 
information.  Given  sparse  correspondences  between  the  image  pair,  view  morphing  works  by  rectifying  the 
two  images  in  such  a  manner  that  the  corresponding  points  lie  in  the  same  scanline  (a  step  known  as  pre¬ 
warping).  This  allows  calculation  of  disparity  map  which  helps  in  retrieving  dense  correspondences.  Once 
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Figure  4.  Removal  of  redundant  node  through  morphing. 


Image  1 


Fig.  12.  Updated  Neighborhood  Graph  after  Morphing. 


the  dense  correspondences  are  known,  the  morph  is  generated  using  cross  dissolve,  and  the  resulting  image 
is  re-projected  to  its  final  position.  Seitz’s  work  could  however  be  used  to  generate  new  views  only  along  the 
line  connecting  the  two  original  images.  Later  on  Wexler  et.  al  [34]  extended  the  concept  to  tri-view  morph¬ 
ing  and  were  able  to  synthesize  morphs  at  any  viewpoint  within  the  boundaries  of  the  triangle  formed  by  the 
three  images. 

Graph  Pruning  A  view-centered  approach  leads  to  a  space  requirement  that  is  larger  than  that  of  object- 
centered  representation  [31].  This  is  because  many  characteristic  features  are  to  be  noted  and  there  might 
be  an  overlap  among  the  images.  This  requires  special  attention  to  be  paid  to  keep  the  size  of  database  at 
minimum.  We  proceed  by  analyzing  for  each  image  if  it  represents  a  morph  of  its  neighbors  or  not.  To  test 
any  image  Ii  we  begin  with  extracting  its  two  adjacent  images  Ij  and  Ik  and  apply  morphing  on  them  to 
generate  features  and  verify  if  they  represent  the  features  originally  extracted  from  Ii  or  not. 

Given  an  Image  pair  (Ij ,  /&),  with  corresponding  feature  points  p  and  q,  we  align  the  image  pairs  to  have 
the  corresponding  points  along  corresponding  scan  lines  and  synthesize  the  features  using  Equation  (3)  for 
varying  values  of  a: 

ps  =  pa  +  (1  —  a)q  (5) 

The  features  generated  in  this  manner  are  compared  with  the  original  features  extracted  from  Ii.  For  this, 
we  have  to  iteratively  engender  and  compare  ps  for  varying  values  of  a.  If  there  exists  an  a  for  which  ps 
represents  the  features  of  Ii,  it  means  Ii  could  be  generated  using  Ij  and  Ik  and  hence  could  be  removed  from 
the  dataset.  This  procedure  is  demonstrated  in  Fig.  11  and  updated  graph  is  shown  in  Fig.  12.  This  proceeds 
till  all  the  images  in  the  database  are  exhausted. 

The  same  procedure  is  then  repeated  for  images  having  larger  number  of  neighbors.  The  strength  of  this 
technique  is  that  we  do  not  have  to  generate  the  intermediate  images  completely.  Rather,  we  simply  work  on 
the  selected  features  of  the  images.  This  saves  us  from  computing  the  disparity  map  which  takes  time. 

3.2  Video-Based  Object  Recognition 

One  way  to  identify  the  target  image  is  to  generate  a  massive  dataset  of  virtual  views  using  morphing  and 
compare  the  test  image  with  all  of  them.  This  is  inefficient  and  computationally  expensive.  We  propose  to 
initially  match  the  test  image  with  only  those  images  stored  in  the  database.  This  helps  in  identifying  an 
approximate  neighborhood  of  the  image  being  examined.  Once  a  seed  image  is  found,  the  virtual  images 
around  it  could  be  generated  using  the  morphing  approach  and  compared  with  the  test  image.  Since  we 
do  not  have  to  generate  the  whole  image;  rather,  we  work  with  the  sparse  features  detected  by  the  feature 
detector,  the  speed  of  our  system  is  increased  due  to  elimination  of  the  disparity  map  generation  step. 

In  order  to  further  strengthen  the  confidence  measure  of  our  detection  results,  we  have  used  video  se¬ 
quences  instead  of  single  image  for  target  recognition.  The  major  advantage  of  this  technique  is  that  the 
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Stored  Models 


Object  2 


Fig.  13.  Using  Video  Sequence  for  Matching. 


Fig.  14.  Linear  Graph  Generated  for  an  Object  of  COIL  dataset. 


video  provides  information  of  multiple  views.  Many  objects  in  real  world  look  alike,  if  observed  from  a 
particular  viewpoint  and  completely  different  when  observed  from  some  other  point  of  reference.  Using  a 
video  for  object  recognition,  we  can  exploit  the  fact  that  the  two  adjacent  images  in  the  video  sequence 
represent  proximally  closer  views  of  the  object.  Hence,  the  adjacent  frames  of  the  video  sequence  should 
point  to  the  same  (or  adjacent)  nodes  of  the  neighborhood  graph.  Thus,  a  correct  identification  results  in  a 
smooth  transition  across  the  multiple  images,  following  an  unbroken  trajectory  in  the  model.  On  the  other 
hand,  an  incorrect  match  results  in  jitters  across  the  multiple  frames,  which  helps  in  identifying  the  incorrect 
matches.  Our  approach  for  developing  the  topological  structure  of  the  images  in  database  provides  ease  of 
traversing  while  using  video  sequence.  As  shown  in  Fig.  13,  given  the  stored  networks  of  objects  and  a  test 
video  sequence,  only  the  correct  object  follows  a  smooth  trajectory  along  the  graph  and  others  suffer  from 
discontinuities. 

3.3  Experiments 

To  evaluate  our  approach  for  target  recognition,  we  used  the  Columbia  Object  Image  Library  (COIL  100)  data 
set  from  the  Columbia  University  and  VIVID  by  DARPA.  In  COIL  there  are  72  images  each  of  100  objects. 
The  viewing  angle  between  these  images  is  uniform,  and  this  leads  to  a  fairly  linear  neighborhood  graph.  See 
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Fig.  15.  (a)  A  portion  of  original  neighborhood  graph  and  (b)  its  pruned  Network. 


Fig.  14  for  neighborhood  graph  generated  for  a  pickup.  In  order  to  capture  the  randomness  of  the  real-world 
image  capture,  we  shot  our  own  video  sequences  following  arbitrary  trajectories. 

Figure  15  shows  a  portion  of  the  neighborhood  graph  of  one  of  the  objects  and  the  updated  Network. 
Experiments  show  that  our  algorithm  could  generate  the  neighborhood  graph  with  a  precision  of  97.86%. 
The  system  was  able  to  reduce  the  image  base  to  about  60%  of  its  original  size. 

Video  based  matching  improved  the  results  obtained  from  single  image  matching.  Single  image  matching 
gave  40%  correct  matches,  while  video-based  recognition  gave  about  80%  correct  matches.  The  reason  for 
the  20%  incorrect  matches  is  the  high  similarity  of  different  objects  at  certain  poses,  which  further  increases 
the  viability  of  our  approach  of  using  videos  instead  of  single  image  for  matching.  The  Fig.  16(a)  shows  a 
smooth  trajectory  for  the  correct  identification  of  motor  bike.  Figure  16(b)  identifies  an  incorrect  matching  of 
a  Humvee  with  the  green  truck  by  pointing  discontinuities. 
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1  Summary 

In  this  report,  we  present  a  novel  method  for  object  class  detection  which  is  based  on  3D  object 
modeling.  Instead  of  using  a  complicated  mechanism  for  relating  multiple  2D  training  views,  our 
method  establishes  spatial  connections  between  these  views  by  mapping  them  directly  to  the  surface 
of  3D  model.  The  3D  shape  of  an  object  is  reconstructed  by  using  a  homographic  framework  from 
a  set  of  model  views  around  the  object  and  is  represented  by  a  volume  consisting  of  binary  slices. 
Features  are  computed  in  each  2D  model  view  and  mapped  to  the  3D  shape  model  using  the  same 
homographic  framework.  Also  we  present  our  work  on  object  recognition  based  on  correlation  using 
morphing  technique.  There  has  been  considerable  interest  in  using  correlators  for  pattern  recognition. 
Correlators  are  inherently  shift-invariant  allowing  us  to  locate  patterns  (such  as  moving  targets)  in 
the  input  scene  merely  by  locating  the  correlation  peak.  Thus,  we  do  not  need  to  segment  or  register 
the  images  prior  to  correlation,  as  we  have  to  do  in  alternate  methods  for  pattern  recognition.  In 
this  report,  we  describe  two  new  methods  for  synthesizing  new  views  of  a  known  object  so  that  the 
occluded  features  of  the  object  can  be  inferred  and  incorporated  into  the  recognition  process. 


2  Model  based  Object  Class  Detection 

The  key  challenge  of  Object  Detection  is  the  ability  to  recognize  any  member  in  a  category  of 
objects  in  spite  of  wide  variations  in  visual  appearance  due  to  geometrical  transformations,  change 
in  viewpoint,  or  illumination.  To  deal  with  these  challenges,  we  developed  a  novel  3D  feature  model 
based  object  class  detection  method.  Our  objective  is  to  detect  the  object  given  an  arbitrary  2D  view 
using  a  general  3D  feature  model  of  the  class.  Here  the  objects  can  be  arbitrarily  transformed  (with 
translation  and  rotation),  and  the  viewing  position  and  orientation  of  the  camera  is  arbitrary  as  well. 
In  addition,  camera  parameters  are  assumed  to  be  unknown. 

Object  detection  in  such  a  setting  has  been  considered  a  very  challenging  problem  due  to  various 
difficulties  of  geometrically  modeling  relevant  3D  object  shapes  and  the  effects  of  perspective  pro¬ 
jection.  In  our  work,  we  exploit  a  recently  proposed  3D  reconstruction  method  using  homographic 
framework  for  3D  object  shape  reconstruction.  Given  a  set  of  2D  images  of  an  object  taken  from 
different  viewpoints  around  the  object  with  unknown  camera  parameters,  which  are  called  model 
views,  the  3D  shape  of  this  specific  object  can  be  reconstructed  using  the  homographic  framework 
proposed  in  [10].  In  our  method,  3D  shape  is  represented  by  a  volume  consisting  of  binary  slices 
with  1  denoting  the  object  and  0  for  background.  By  using  this  method,  we  can  not  only  reconstruct 
3D  shapes  for  the  objects  to  be  detected,  but  also  have  access  to  the  homographies  between  the 


2D  views  and  the  3D  models,  which  are  then  used  to  build  the  3D  feature  model  for  object  class 
detection. 

In  the  feature  modeling  phase  of  our  method,  SIFT  features  [12]  are  computed  for  each  of 
the  2D  model  views  and  mapped  to  the  surface  of  the  3D  model.  Since  it  is  difficult  to  accurately 
relate  2D  coordinates  to  a  3D  model  by  projecting  the  3D  model  to  a  2D  view  (with  unknown 
camera  parameters),  we  propose  to  use  a  homography  transformation  based  algorithm.  Since  the 
homographies  have  been  obtained  during  the  3D  shape  reconstruction  process,  the  projection  of  a 
3D  model  can  be  easily  computed  by  integrating  the  transformations  of  slices  from  the  model  to  a 
particular  view,  as  opposed  to  directly  projecting  the  entire  model  by  estimation  of  the  projection 
matrix.  To  generalize  the  model  for  object  class  detection,  images  of  other  objects  of  the  class  are 
used  as  supplemental  views.  Features  from  these  views  are  mapped  to  the  3D  model  in  the  same 
way  as  for  those  model  views.  A  codebook  is  constructed  from  all  of  these  features  and  then  a  3D 
feature  model  is  built.  The  3D  feature  model  thus  combines  the  3D  shape  information  and  appearance 
features  for  robust  object  class  detection. 

Given  a  new  2D  test  image,  correspondences  between  the  3D  feature  model  and  this  testing  view 
are  identified  by  matching  feature.  Based  on  the  3D  locations  of  the  corresponding  features,  several 
hypotheses  of  viewing  planes  can  be  made.  For  each  hypothesis,  the  feature  points  are  projected  to 
the  viewing  plane  and  aligned  with  the  features  in  the  2D  testing  view.  A  confidence  is  assigned  to 
each  hypothesis  and  the  one  with  the  highest  confidence  is  then  used  to  produce  the  object  detection 
result. 


2.1  Research  Background  and  Related  Works 


As  the  approaches  for  recognizing  an  object  class  from  some  particular  viewpoint  or  detecting  a 
specific  object  from  an  arbitrary  view  are  advancing  toward  maturity  [3,9,  11],  solutions  to  the 
problem  of  object  class  detection  using  multiple  views  are  still  relatively  far  behind.  Object  detection 
can  be  considered  even  more  difficult  than  classification,  since  it  is  expected  to  provide  accurate 
location  and  size  of  the  object. 

Researchers  in  computer  vision  have  studied  the  problem  of  multi- view  object  class  detection 
resulting  successful  approaches  following  two  major  directions.  One  path  attempts  to  use  increasing 
number  of  local  features  by  applying  multiple  feature  detectors  simultaneously  [1, 6, 13-15].  It  has 
been  shown  that  the  recognition  performance  can  be  benefited  by  providing  more  feature  support. 
However,  the  spatial  connections  of  the  features  in  each  view  and/or  between  different  views  have 
not  been  pursued  in  these  works.  These  connections  can  be  crucial  in  object  class  detection  tasks. 
Recently,  much  attention  has  been  drawn  to  the  second  direction  related  to  multiple  views  for  object 
class  detection  [5,7,8].  The  early  methods  apply  several  single  view  detectors  independently  and 
combine  their  responses  via  some  arbitration  logic.  Features  are  shared  among  the  different  single¬ 
view  detectors  to  limit  the  computational  overload.  Most  recently,  Thomas  et  al.  [16]  developed  a 
single  integrated  multi- view  detector  that  accumulates  evidence  from  different  training  views.  Their 
work  combines  a  multi-view  specific  object  recognition  system  [9],  and  the  Implicit  Shape  Model 
for  object  class  detection  [11],  where  single- view  codebooks  are  strongly  connected  by  the  exchange 
of  information  via  sophisticated  activation  links  between  each  other. 

Here  we  introduce  a  unified  method  to  relate  multiple  2D  views  based  on  3D  object  modeling. 
The  main  advantage  of  this  method  is  an  novel  efficient  object  detection  system  capable  of  recog¬ 
nizing  and  localizing  objects  from  the  same  class  under  different  viewing  conditions.  Consequently, 
3D  locations  of  the  features  are  considered  during  detection  and  better  accuracy  is  obtained. 
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2.2  Object  Detection  using  3D  Shape  Model 

3D  Shape  Model.  Let  It  denote  the  foreground  likelihood  map  (where  each  pixel  value  is  the 
likelihood  of  that  pixel  being  a  foreground)  in  the  /th  view  of  total  M  views.  Considering  a  reference 
plane,  nr,  in  the  scene  with  homography  H„ri  from  the  /th  view  to  /rr,  warping  /,  to  nr  gives  the 
warped  foreground  likelihood  map: 

hr  =  [Hnr,i]h  (1) 

The  visual  hull  intersection  on  nr  (AND-fusion  of  the  shadow  regions)  is  achieved  by  multiplying 
these  warped  foreground  likelihood  maps: 

M 

er  =  n  hr,  (2) 

1=1 

where  6r  is  the  grid  of  the  object  occupancy  likelihoods  plane  nr.  Each  value  in  0r  gives  the  likelihood 
of  this  grid  location  being  inside  the  body  of  the  object,  indeed,  representing  a  slice  of  the  object  cut 
out  by  plane  nr.  It  should  be  noted  that  due  to  the  multiplication  step  in  (2),  the  locations  outside  the 
visual  hull  intersection  region  will  be  penalized,  thus,  having  a  much  lower  occupancy  likelihood. 

The  grid  of  the  object  occupancy  likelihood  can  be  computed  at  an  arbitrary  number  of  planes 
in  the  scene  with  different  heights,  each  giving  a  slice  of  the  object.  Naturally  this  does  not  apply  to 
planes  that  do  not  pass  through  the  object’s  body,  since  visual  hull  intersection  on  these  planes  will 
be  empty,  therefore  a  separate  check  is  not  necessary. 

Let  v*,  \y,  and  v7  denote  the  vanishing  points  for  the  X,  Y,  and  Z  directions,  respectively,  and 
1  be  the  normalized  vanishing  line  of  reference  plane  in  the  XYZ  coordinate  space.  The  reference 
plane  to  the  image  view  homography  can  be  represented  as 

Href  =  [yx  Vy  l]  .  (3) 

Supposing  that  another  plane  n  has  a  translation  of  z  along  the  reference  direction  Z  from  the  refer¬ 
ence  plane,  it  is  easy  to  show  that  the  homography  of  plane  n  to  the  image  view  can  be  computed 
by 

fin  =  [vA  v7  az\z  + 1]  =  Href  +  [I|«rv-] ,  (4) 

where  a  is  a  scaling  factor.  The  image  to  plane  homography  Hn  is  obtained  by  inverting  Hn. 

Starting  with  a  reference  plane  in  the  scene  (typically  the  ground  plane),  visual  hull  intersection 
is  performed  on  successively  parallel  planes  in  the  up  direction  along  the  body  of  the  object.  The 
occupancy  grids  6t  are  stacked  up  to  create  a  three  dimensional  data  structure  0  =  [6\\  02\ . . .  0M]- 
0  represents  a  discrete  sampling  of  a  continuous  occupancy  space  encapsulating  the  object  shape. 
Object  structure  is  then  segmented  out  from  0  by  dividing  the  space  into  the  object  and  background 
regions  using  the  geodesic  active  contour  method  [2].  By  using  the  above  homography  based  frame¬ 
work,  3D  models  for  different  objects  can  be  constructed.  In  our  method,  not  only  the  3D  shape  of 
the  target  object  is  exploited,  but  also  the  appearance  features.  We  relate  the  features  with  the  3D 
model  to  construct  a  feature  model  for  object  class  detection. 

The  features  used  in  our  work  are  computed  using  the  SIFT  feature  detector  [12].  Feature  vectors 
are  computed  for  all  of  the  training  images.  In  order  to  efficiently  relate  the  features  computed  from 
different  views  and  different  objects,  all  the  detected  features  are  attached  to  the  3D  surface  of  the 
previously  built  model.  By  using  the  3D  feature  model,  we  avoid  storing  all  the  2D  training  views, 
thus  there  is  no  need  to  build  complicated  connections  between  the  views.  The  spatial  relationship 
between  the  feature  points  from  different  views  are  readily  available,  which  can  be  easily  retrieved 
when  matched  feature  points  are  found. 

The  features  computed  in  2D  images  are  attached  to  the  3D  model  by  using  the  novel  homo¬ 
graphic  framework.  Instead  of  directly  finding  the  3D  location  of  each  2D  feature,  we  map  the  3D 
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(a)  Volume  projection 


(b)  Plane  transformation 


Fig.  1.  Illustration  of  equivalence  of  3D  to  2D  projection  and  plane  transformation  using  homogra- 
phies.  (a)  A  2D  view  of  a  3D  volume  V  is  generated  by  projecting  the  volume  on  a  image  plane,  (b) 
The  same  view  can  be  obtained  by  integrating  the  transformation  of  each  slice  in  the  volume  to  the 
image  plane  using  homographies. 


points  from  the  model’s  surface  to  the  2D  views,  and  find  the  corresponding  features.  Our  method 
does  not  require  the  estimation  of  a  projection  matrix  from  3D  model  to  a  2D  image  plane,  which  is 
a  non-trivial  problem.  In  our  work,  the  problem  is  successfully  solved  by  transforming  the  model  to 
various  image  planes  using  homography.  Since  the  homographies  between  the  model  and  the  image 
planes  have  already  been  obtained  during  the  construction  of  the  3D  model,  we  are  able  to  map  the 
3D  points  to  2D  planes  using  homography  transformation. 

In  our  work,  a  3D  shape  is  represented  by  a  binary  volume  V,  which  consists  of  K  slices  S  j, 
j  e  [1,  K\.  As  shown  in  Fig.  1(b),  each  slice  of  the  object  is  transformed  to  a  2D  image  plane  by 
using  the  corresponding  homography  H  in  (4).  The  transformed  slice  accounts  for  a  small  patch  of 
the  object  projection.  Integrating  all  these  K  patches  together,  the  whole  projection  of  3D  object  in 
the  2D  image  plane  can  be  produced.  In  this  way,  we  obtain  the  model  projection  by  using  a  series 
of  simple  homography  transformations  and  the  hard  problem  of  estimating  the  projection  matrix  of 
a  3D  model  to  a  2D  view  is  avoided. 

In  our  method,  the  3D  shapes  are  represented  using  binary  volumes  with  a  stack  of  slices  along 
the  reference  direction.  Thus,  the  surface  points  can  be  easily  obtained  by  applying  edge  detection 
techniques.  After  transforming  the  surface  points  to  2D  planes,  feature  vectors  computed  in  2D  can 
be  related  to  the  3D  points  according  to  their  locations.  That  is  the  way  a  3D  feature  model  is  built. 

The  training  images  in  our  work  come  from  two  sources.  One  set  of  images  is  taken  around 
a  specific  object  of  the  target  class  to  reconstruct  it  in  3D  as  shown  in  Fig.  2.  These  images  are 
called  model  views ,  which  provide  multiple  views  of  the  object  but  are  limited  to  the  specific  object. 
To  generalize  the  model  for  recognizing  other  objects  in  the  same  class,  another  set  of  training 
images  is  obtained  by  using  Google  image  search.  Images  of  objects  in  the  same  class  with  different 
appearances  and  postures  are  selected.  These  images  are  denoted  as  the  supplemental  views. 

Since  the  homographies  between  the  supplemental  images  and  the  3D  model  are  unknown, 
features  computed  from  the  supplemental  images  cannot  be  directly  attached  to  the  feature  model. 
Instead,  we  utilize  the  model  views  as  bridges  to  connect  the  supplemental  images  to  the  model 
as  illustrated  in  Fig.  2.  For  each  supplemental  image,  the  model  view,  which  has  the  most  similar 
viewpoint  is  specified.  The  supplemental  images  are  deformed  to  their  specified  view  by  using  an 
affine  transformation  alignment.  Then  we  can  assume  that  each  supplemental  image  will  have  the 
same  homography  as  the  model’s  corresponding  view.  The  2D  features  computed  from  all  of  the 
supplemental  training  images  can  now  be  correctly  attached  to  the  3D  model  surface  using  the  same 
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Fig.  2.  Construction  of  3D  feature  model  for  motorbikes.  3D  shape  model  of  motorbike  (at  center) 
is  constructed  using  the  model  views  (images  on  the  inner  circle)  taken  around  the  object  from 
different  viewpoints.  Supplemental  images  (outer  circle)  of  different  motorbikes  are  obtained  by 
using  Google’s  image  search.  The  supplemental  images  are  aligned  with  the  model  views  for  feature 
mapping.  Feature  vectors  are  computed  from  all  the  training  images  and  then  attached  to  the  3D 
model  surface  by  using  the  homography  transformation. 


method  as  discussed  for  the  model  views.  A  codebook  is  constructed  by  combining  all  the  mapped 
features  with  their  3D  locations. 


Object  Class  Detection.  Given  a  new  test  image,  our  objective  is  to  detect  objects  belonging  to 
the  same  class  in  this  image  by  using  the  learnt  3D  feature  model  M.  Each  entry  of  M  consists  of 
a  code  and  its  3D  locations  { c ,  l3c}.  Let  F  denote  the  SIFT  features  computed  from  the  input  image, 
which  is  composed  by  the  feature  descriptor  and  its  2D  location  in  the  image  {/,  l2}.  Object  On  is 
detected  by  matching  the  features  F  to  the  3D  feature  model  M. 

In  our  work,  feature  matching  is  achieved  in  three  phases.  In  the  first  phase,  we  match  the 
features  by  comparing  all  the  input  features  to  the  codebook  entries  in  Euclidean  space.  However, 
not  all  the  matched  codebook  entries  in  3D  are  visible  at  the  same  time  from  a  particular  viewpoint. 
So,  in  the  second  phase,  matched  codes  in  3D  are  projected  to  viewing  planes  and  hypotheses  of 
viewpoints  are  made  by  selecting  viewing  planes  with  the  largest  number  of  visible  points  projected. 
In  the  third  phase,  for  each  hypothesis,  the  projected  points  are  compared  to  2D  matched  feature 
points  using  both  feature  descriptors  and  locations.  This  is  done  by  iteratively  estimating  the  affine 
transformation  between  the  feature  point  sets  and  removing  the  outliers  with  large  distance  between 
corresponding  points.  Outliers  belonging  to  the  background  can  be  rejected  during  this  matching 
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Fig.  3.  Detection  of  motorbikes  and  horses  using  the  proposed  approach.  The  ground  truth  is  shown 
in  green  and  red  boxes  display  our  detected  results. 


process.  The  object  location  and  bounding  box  is  then  determined  according  to  the  2D  locations  of 
the  final  matched  feature  points.  The  confidence  of  detection  is  given  by  the  degree  of  match. 


Experimental  Results.  Our  method  has  been  tested  on  two  object  classes:  motorbikes  and  horses. 
For  the  motorbikes,  we  took  23  model  views  around  a  motorbike  and  obtained  45  supplemental  views 
by  using  Google’s  image  search.  Some  training  images  of  the  motorbikes  and  the  3D  shape  model 
are  shown  in  Fig.  2.  For  the  horses,  18  model  views  were  taken  and  51  supplemental  views  were 
obtained. 

To  measure  the  performance  of  our  3D  feature  model  based  object  class  detection  technique,  we 
have  evaluated  the  method  on  the  PASCAL  VOC  Challenge  2006  test  dataset  [4],  which  has  become 
a  standard  testing  dataset  for  objective  evaluation  of  object  classification  and  detection  algorithms. 
The  dataset  is  very  challenging  due  to  the  large  variability  in  the  scale  and  poses,  the  extensive 
clutter,  and  poor  imaging  conditions.  Some  successful  detection  results  are  shown  in  Fig.  3.  The 
green  box  indicates  the  ground  truth,  while  our  results  are  shown  in  red  boxes. 

For  quantitative  evaluation,  we  adopt  the  same  evaluation  criteria  used  in  PASCAL  VOC  chal¬ 
lenge,  so  that  our  results  can  be  directly  comparable  with  [4,8, 16].  By  using  this  criteria,  a  detection 
is  considered  correct,  if  the  area  of  overlap  between  the  predicted  bounding  box  Bp  and  ground  truth 
bounding  box  Bgt  exceeds  50%  using  the  formula 


area(Bp  n  Bgt) 
area(Bp  U  Bgt) 


(5) 
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Fig.  4.  The  PR  curves  for  (a)  motorbike  detection  and  (b)  horse  detection  using  our  3D  feature  model 
based  approach.  The  curves  reported  in  [4]  on  the  same  test  dataset  are  also  included  for  comparison. 


The  average  precision  (AP)  and  precision-recall  (PR)  curve  can  then  be  computed  for  performance 
evaluation. 

Fig.  8(a)  shows  the  PR  curves  of  our  approach  and  the  methods  in  [8,16]  for  motorbike  detection. 
The  curve  of  our  approach  shows  a  substantial  improvement  over  the  precision  compared  to  the 
method  in  [8],  which  is  also  indicated  by  the  AP  value  (0.182).  Although  our  performance  is  lower 
than  that  of  [16],  considering  the  smaller  training  image  set  used  in  our  experiments,  this  can  be 
regarded  as  satisfactory.  Fig.  8(b)  shows  the  performance  curves  for  horse  detection.  While  there 
is  no  result  reported  in  the  YOC  challenge  using  researchers’  own  training  dataset  for  this  task,  we 
compared  our  result  to  those  using  the  provided  training  dataset.  Our  approach  performs  better  than 
the  reported  methods  and  obtained  AP  value  of  0.144.  It  is  noted  that  the  absolute  performance  level 
is  lower  than  that  of  motorbike  detection,  which  might  be  caused  by  the  non-rigid  body  deformation 
of  horses. 

3  Correlation  Pattern  Recognition  Based  on  View  Morphing 

In  this  report,  we  describe  two  new  methods  for  synthesizing  new  views  of  a  known  object.  We 
have  shown  previously  that  given  a  set  of  paired  signatures  in  Y  and  X,  it  is  possible  to  model  the 
view  synthesis  process  as  a  linear  transform  A  such  that  AY  =  X.  In  fact,  the  minimum  squared  error 
solution  was  shown  to  be 

A  =  XYr  (YYr)_1 .  (6) 

Then,  if  y  (a  column  of  Y)  is  an  “observed”  signature,  the  matrix  A  can  be  used  to  obtain  the 
prediction  x  (a  column  of  X)  via  the  equation  Ay  =  x. 

3.1  Interpolated  Nearest  Neighbor 

The  problem  is  that  while  the  overall  squared  error  is  minimized,  the  performance  at  or  near  individ¬ 
ual  signatures  can  be  poor.  To  overcome  this,  we  limit  the  signature  pairs  to  be  the  nearest-neighbors 
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of  the  input  signature.  Thus,  given  an  observed  input  signature  (say  z),  we  select  the  M  nearest  neigh¬ 
bor  columns  of  Y  (denoted  by  the  matrix  Z)  and  the  corresponding  columns  of  X  (now  denoted  by 
the  matrix  U)  and  write  the  equation: 

AZ  =  U.  (7) 

The  problem  now  is  that  M  is  substantially  smaller  than  the  dimension  of  the  vector  space,  and  the 
minimum  squared  error  solution  for  A  cannot  be  computed  since  ZZT  is  singular.  One  approach  to 
overcome  this  limitation  is  to  avoid  the  explicit  computation  of  A  altogether  by  assuming  that  the 
observed  input  signature  y  can  be  approximated  as  a  weighted  linear  combination  (or  interpolation) 
of  its  nearest  neighbors,  i.e. 

y  =  Za.  (8) 

Of  course,  given  y  and  Z  it  is  easy  to  obtain  the  weights 

a  =  (Zrz)~‘  Zy.  (9) 

With  this  simplification,  the  estimation  equation  becomes 

Ay  =  AZa  -  Ua.  (10) 

In  other  words,  if  the  interpolation  weight  vector  a  is  known,  the  predicted  response  to  y  is 
simply  the  interpolation  (weighted  linear  combination)  of  the  columns  of  U.  In  other  words,  we 
don’t  actually  need  to  know  the  transform  matrix  A,  as  long  as  we  have  an  estimate  for  the  weight 
vector  a  and  the  ideal  predicted  signatures  in  U. 


3.2  Constrained  Optimization 

In  this  section  we  combine  the  advantages  of  the  original  LMSE  approach,  and  the  nearest  neighbor 
interpolation  technique.  Specifically,  we  require  A  to  minimize  MSE  across  all  the  data,  but  satisfy 
exact  relations  in  the  neighborhood  of  the  test  vector. 

Let  Y  be  the  input  data  matrix  and  X  be  the  desired  output  data.  We  wish  to  find  the  linear 
transform  A  that  will  satisfy  the  equation  in  a  minimum  squared  error  (mse)  sense.  In  terms  of  the 
columns  of  the  input  and  output  data  matrices,  this  can  be  written  as 

N 

*  =  Z|Ay,-x,f.  (ll) 

/= 1 

At  the  same  time,  we  wish  to  exactly  satisfy  the  linear  relation  in  the  immediate  neighborhood  of 
the  test  input  z.  Let  the  columns  of  Z  represent  the  M  nearest  neighbors  of  z,  and  the  corresponding 
exact  outputs  are  the  columns  of  U.  There  we  have  the  hard  constraints  AZ  =  U.  We  find  A  by 
solving  for  one  row  at  a  time  by  formulating  a  constrained  minimization  problem.  Let  aT  be  a  row 
of  A,  and  be  the  corresponding  row  of  X.  The  error  can  be  written  for  each  row  as 

E  -  |arY  -  xT\2 

=  |Yra  -  x|2  =  arYrYa  +  xrx  -  2arYx  (12) 

=  arDa  +  xTx  -  2arYx. 

The  constraints  on  aT  is  similarly  written  as 


Zra  =  u 


(13) 


where  u  is  the  corresponding  row  of  U. 


Using  the  method  of  Lagrange  multipliers,  we  form  the  functional 


0  -  arDa  +  xTx  -  2arYx  -  2A\  (arzi  -  -  2 A2  (arz2  -  m2) - 2 AM  (sltzm  -  uM) .  (14) 

The  derivative  of  this  w.r.t.  to  a  is 

Va0  =  2Da  -  2Yx  -  2  (TiZi  +  T2z2  +  •  •  •  +  TiZi)  =  2Da  -  2Yx  -  2Z1.  (15) 

Setting  this  to  zero,  we  get 

Da  =  Yx  +  Z1  (16) 

or 

a  =  D~‘  (Yx  +  Zl) .  (17) 

We  then  substitute  this  in  the  constraint  equation  to  get 

u  =  ZrD  1  (Yx  +  Zl)  =  ZTD  Yx  +  ZrD  Zl.  (18) 


Therefore, 


1  =  (zrD“1z)'1  (u  -  ZrD-1Yx) . 

Finally,  the  expression  for  a  is  obtained  as 

a  =  D  *Yx  +  D_1z(zrD~1z)_1  (u  -  ZrD  1  Yx) . 


(19) 

(20) 


3.3  Examples  and  Comparisons 

In  this  section  we  illustrate  and  compare  the  previous  LMSE  technique,  the  interpolated  nearest 
neighbor  method,  and  the  new  constrained  optimization  approach.  We  present  several  cases  below 
where  a  group  of  four  images  on  the  left  show  the  performance  of  the  prediction  algorithms,  and 
the  input  test  image  is  shown  on  the  right.  In  each  case,  the  the  top  left  corner  is  the  “ideal”  image 
to  be  predicted.  The  top  right  is  the  result  of  the  previous  LMSE  approach,  the  bottom  left  is  the 
output  of  the  nearest  neighbor  method,  and  the  bottom  right  is  the  prediction  obtained  using  the 
constrained  technique.  For  each  predicted  image,  the  normalized  similarity  to  the  ideal  image  is 
shown  in  the  title  (larger  is  better).  For  example  in  case  1  below,  the  LMSE  approach  achieves  a 
similarity  of  0.75,  and  the  nearest  neighbor  approach  produces  a  worse  result  with  similarity  of  0.55. 
The  constrained  optimization  technique  performs  the  best  by  producing  an  image  which  has  0.79 
similarity  to  the  ideal  image.  The  same  is  found  to  be  true  for  the  other  3  cases  as  well  where  the 
constrained  optimization  technique  performs  the  best. 
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